CrowdTruth for Temporal Event Ordering Annotation

This analysis uses the data gathered in the "Event Annotation" crowdsourcing experiment published in Rion Snow, Brendan O’Connor, Dan Jurafsky, and Andrew Y. Ng: Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP 2008, pages 254–263.

Task Description: Given two events in a text, the crowd has to choose whether the first event happened "strictly before" or "strictly after" the second event. Following, we provide an example from the aforementioned publication:

Text: “It just blew up in the air, and then we saw two fireballs go down to the, to the water, and there was a big small, ah, smoke, from ah, coming up from that”.

Events: go/coming, or blew/saw

A screenshot of the task as it appeared to workers can be seen at the following repository.

The dataset for this task was downloaded from the following repository, which contains the raw output from the crowd on AMT. Currently, you can find the processed input file in the folder named data. Besides the raw crowd annotations, the processed file also contains the sentence and the two events that were given as input to the crowd (for part of the dataset).


In [1]:
import pandas as pd

test_data = pd.read_csv("../data/temp.standardized.csv")
test_data.head()


Out[1]:
!amt_annotation_ids !amt_worker_ids orig_id response gold start end event1 event2 text
0 1 A2HTGQE4AACVRV 42_0 before before Mon Mar 25 07:39:42 PDT 2019 Mon Mar 25 07:41:05 PDT 2019 NaN NaN NaN
1 2 AYHHOK9GDSWNH 42_0 before before Mon Mar 25 07:39:42 PDT 2019 Mon Mar 25 07:41:05 PDT 2019 NaN NaN NaN
2 3 A1QRQZWBL1SVEX 42_0 before before Mon Mar 25 07:39:42 PDT 2019 Mon Mar 25 07:41:05 PDT 2019 NaN NaN NaN
3 4 A3G0MGLBT484I1 42_0 before before Mon Mar 25 07:39:42 PDT 2019 Mon Mar 25 07:41:05 PDT 2019 NaN NaN NaN
4 5 A7NC1H5ZK7TO0 42_0 before before Mon Mar 25 07:39:42 PDT 2019 Mon Mar 25 07:41:05 PDT 2019 NaN NaN NaN

Declaring a pre-processing configuration

The pre-processing configuration defines how to interpret the raw crowdsourcing input. To do this, we need to define a configuration class. First, we import the default CrowdTruth configuration class:


In [2]:
import crowdtruth
from crowdtruth.configuration import DefaultConfig

Our test class inherits the default configuration DefaultConfig, while also declaring some additional attributes that are specific to the Temporal Event Ordering task:

  • inputColumns: list of input columns from the .csv file with the input data
  • outputColumns: list of output columns from the .csv file with the answers from the workers
  • customPlatformColumns: a list of columns from the .csv file that defines a standard annotation tasks, in the following order - judgment id, unit id, worker id, started time, submitted time. This variable is used for input files that do not come from AMT or FigureEight (formarly known as CrowdFlower).
  • annotation_separator: string that separates between the crowd annotations in outputColumns
  • open_ended_task: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
  • annotation_vector: list of possible crowd answers, mandatory to declare when open_ended_task is False; for our task, this is the list of relations
  • processJudgments: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector

The complete configuration class is declared below:


In [3]:
class TestConfig(DefaultConfig):
    inputColumns = ["gold", "event1", "event2", "text"]
    outputColumns = ["response"]
    customPlatformColumns = ["!amt_annotation_ids", "orig_id", "!amt_worker_ids", "start", "end"]
    
    # processing of a closed task
    open_ended_task = False
    annotation_vector = ["before", "after"]
    
    def processJudgments(self, judgments):
        # pre-process output to match the values in annotation_vector
        for col in self.outputColumns:
            # transform to lowercase
            judgments[col] = judgments[col].apply(lambda x: str(x).lower())
        return judgments

Pre-processing the input data

After declaring the configuration of our input file, we are ready to pre-process the crowd data:


In [4]:
data, config = crowdtruth.load(
    file = "../data/temp.standardized.csv",
    config = TestConfig()
)

data['judgments'].head()


Out[4]:
output.response output.response.count output.response.unique started unit submitted worker duration job
judgment
1 {u'before': 1, u'after': 0} 1 2 2019-03-25 07:39:42-07:00 42_0 2019-03-25 07:41:05-07:00 A2HTGQE4AACVRV 83 ../data/temp.standardized
2 {u'before': 1, u'after': 0} 1 2 2019-03-25 07:39:42-07:00 42_0 2019-03-25 07:41:05-07:00 AYHHOK9GDSWNH 83 ../data/temp.standardized
3 {u'before': 1, u'after': 0} 1 2 2019-03-25 07:39:42-07:00 42_0 2019-03-25 07:41:05-07:00 A1QRQZWBL1SVEX 83 ../data/temp.standardized
4 {u'before': 1, u'after': 0} 1 2 2019-03-25 07:39:42-07:00 42_0 2019-03-25 07:41:05-07:00 A3G0MGLBT484I1 83 ../data/temp.standardized
5 {u'before': 1, u'after': 0} 1 2 2019-03-25 07:39:42-07:00 42_0 2019-03-25 07:41:05-07:00 A7NC1H5ZK7TO0 83 ../data/temp.standardized

Computing the CrowdTruth metrics

The pre-processed data can then be used to calculate the CrowdTruth metrics:


In [5]:
results = crowdtruth.run(data, config)

results is a dict object that contains the quality metrics for the sentences, annotations and crowd workers.

The sentence metrics are stored in results["units"]:


In [6]:
results["units"].head()


Out[6]:
duration input.event1 input.event2 input.gold input.text job output.response output.response.annotations output.response.unique_annotations worker uqs unit_annotation_score uqs_initial unit_annotation_score_initial
unit
100_1 83 NaN NaN before NaN ../data/temp.standardized {u'after': 3, u'before': 7} 10 2 10 0.720189 {u'after': 0.143649950855, u'before': 0.856350... 0.533333 {u'after': 0.3, u'before': 0.7}
102_0 83 NaN NaN after NaN ../data/temp.standardized {u'after': 7, u'before': 3} 10 2 10 0.613106 {u'after': 0.782360403297, u'before': 0.217639... 0.533333 {u'after': 0.7, u'before': 0.3}
102_1 83 NaN NaN before NaN ../data/temp.standardized {u'after': 3, u'before': 7} 10 2 10 0.700250 {u'after': 0.15658087953, u'before': 0.8434191... 0.533333 {u'after': 0.3, u'before': 0.7}
103_1 83 NaN NaN before NaN ../data/temp.standardized {u'after': 3, u'before': 7} 10 2 10 0.700250 {u'after': 0.15658087953, u'before': 0.8434191... 0.533333 {u'after': 0.3, u'before': 0.7}
103_11 83 NaN NaN before NaN ../data/temp.standardized {u'after': 2, u'before': 8} 10 2 10 0.785005 {u'after': 0.105941791355, u'before': 0.894058... 0.644444 {u'after': 0.2, u'before': 0.8}

The uqs column in results["units"] contains the sentence quality scores, capturing the overall workers agreement over each sentences. Here we plot its histogram:


In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = 15, 5

plt.subplot(1, 2, 1)
plt.hist(results["units"]["uqs"])
plt.ylim(0,200)
plt.xlabel("Sentence Quality Score")
plt.ylabel("#Sentences")

plt.subplot(1, 2, 2)
plt.hist(results["units"]["uqs_initial"])
plt.ylim(0,200)
plt.xlabel("Sentence Quality Score Initial")
plt.ylabel("# Units")


Out[7]:
Text(0,0.5,'# Units')

Plot the change in unit qualtity score at the beginning of the process and at the end


In [8]:
import numpy as np

sortUQS = results["units"].sort_values(['uqs'], ascending=[1])
sortUQS = sortUQS.reset_index()

plt.rcParams['figure.figsize'] = 15, 5

plt.plot(np.arange(sortUQS.shape[0]), sortUQS["uqs_initial"], 'ro', lw = 1, label = "Initial UQS")
plt.plot(np.arange(sortUQS.shape[0]), sortUQS["uqs"], 'go', lw = 1, label = "Final UQS")

plt.ylabel('Sentence Quality Score')
plt.xlabel('Sentence Index')


Out[8]:
Text(0.5,0,'Sentence Index')

The unit_annotation_score column in results["units"] contains the sentence-annotation scores, capturing the likelihood that an annotation is expressed in a sentence. For each sentence, we store a dictionary mapping each annotation to its sentence-relation score.


In [9]:
results["units"]["unit_annotation_score"].head()


Out[9]:
unit
100_1     {u'after': 0.143649950855, u'before': 0.856350...
102_0     {u'after': 0.782360403297, u'before': 0.217639...
102_1     {u'after': 0.15658087953, u'before': 0.8434191...
103_1     {u'after': 0.15658087953, u'before': 0.8434191...
103_11    {u'after': 0.105941791355, u'before': 0.894058...
Name: unit_annotation_score, dtype: object

Save unit metrics:


In [30]:
rows = []
header = ["orig_id", "gold", "text", "event1", "event2", "uqs", "uqs_initial", "before", "after", "before_initial", "after_initial"]

units = results["units"].reset_index()
for i in range(len(units.index)):
    row = [units["unit"].iloc[i], units["input.gold"].iloc[i], units["input.text"].iloc[i], units["input.event1"].iloc[i],\
           units["input.event2"].iloc[i], units["uqs"].iloc[i], units["uqs_initial"].iloc[i], \
           units["unit_annotation_score"].iloc[i]["before"], units["unit_annotation_score"].iloc[i]["after"], \
           units["unit_annotation_score_initial"].iloc[i]["before"], units["unit_annotation_score_initial"].iloc[i]["after"]]
    rows.append(row)
rows = pd.DataFrame(rows, columns=header)
rows.to_csv("../data/results/crowdtruth_units_temp.csv", index=False)

The worker metrics are stored in results["workers"]:


In [24]:
results["workers"].head()


Out[24]:
duration job judgment unit wqs wwa wsa wqs_initial wwa_initial wsa_initial
worker
A1123L7ANYUTG0 83 1 20 20 0.833171 0.858963 0.969973 0.676611 0.738889 0.915714
A11GX90QFWDLMM 83 1 462 462 0.282526 0.504831 0.559645 0.289752 0.489177 0.592326
A13PCLSK1JA8QL 83 1 10 10 0.212647 0.423645 0.501946 0.210866 0.400000 0.527164
A16QMNGIR7N53M 83 1 10 10 0.586793 0.702512 0.835277 0.454946 0.588889 0.772549
A17743NDSCO8P5 83 1 10 10 0.443041 0.625937 0.707804 0.393730 0.566667 0.694817

The wqs columns in results["workers"] contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.


In [25]:
plt.rcParams['figure.figsize'] = 15, 5

plt.subplot(1, 2, 1)
plt.hist(results["workers"]["wqs"])
plt.ylim(0,30)
plt.xlabel("Worker Quality Score")
plt.ylabel("#Workers")

plt.subplot(1, 2, 2)
plt.hist(results["workers"]["wqs_initial"])
plt.ylim(0,30)
plt.xlabel("Worker Quality Score Initial")
plt.ylabel("#Workers")


Out[25]:
Text(0,0.5,'#Workers')

Save the worker metrics:


In [26]:
results["workers"].to_csv("../data/results/crowdtruth_workers_temp.csv", index=True)

The annotation metrics are stored in results["annotations"]. The aqs column contains the annotation quality scores, capturing the overall worker agreement over one relation.


In [27]:
results["annotations"]


Out[27]:
output.response aqs aqs_initial
after 4620 0.836744 0.723143
before 4620 0.824213 0.713564

In [28]:
import numpy as np

sortedUQS = results["units"].sort_values(["uqs"])
# remove the units for which we don't have the events and the text
sortedUQS = sortedUQS.dropna()

Example of a very clear unit


In [17]:
sortedUQS.tail(1)


Out[17]:
duration input.event1 input.event2 input.gold input.text job output.response output.response.annotations output.response.unique_annotations worker uqs unit_annotation_score uqs_initial unit_annotation_score_initial
unit
10_3 83 <font color="blue">acquire</font> <font color="purple">reached</font> after <p>Ratners Group PLC's U.S. subsidiary has agr... ../data/temp.standardized {u'after': 10, u'before': 0} 10 1 10 1.0 {u'after': 1.0, u'before': 0.0} 1.0 {u'after': 1.0, u'before': 0.0}

In [18]:
print("Text: %s" % sortedUQS["input.text"].iloc[len(sortedUQS.index)-1])
print("\n Event1: %s" % sortedUQS["input.event1"].iloc[len(sortedUQS.index)-1])
print("\n Event2: %s" % sortedUQS["input.event2"].iloc[len(sortedUQS.index)-1])
print("\n Expert Answer: %s" % sortedUQS["input.gold"].iloc[len(sortedUQS.index)-1])
print("\n Crowd Answer with CrowdTruth: %s" % sortedUQS["unit_annotation_score"].iloc[len(sortedUQS.index)-1])
print("\n Crowd Answer without CrowdTruth: %s" % sortedUQS["unit_annotation_score_initial"].iloc[len(sortedUQS.index)-1])


Text: <p>Ratners Group PLC's U.S. subsidiary has agreed to <b><u><font color="blue">acquire</font></u></b> jewelry retailer Weisfield's Inc. for $50 a share, or about $55 million. <br></br>Weisfield's shares <b><u><font color="green">soared</font></u></b> on the announcement yesterday, closing up $11 to <b><u><font color="orange">close</font></u></b> at $50 in national over-the-counter trading. <br></br>Ratners and Weisfield's <b><u><font color="red">said</font></u></b> they <b><u><font color="purple">reached</font></u></b> an agreement in principle for the acquisition of Weisfield's by Sterling Inc. <br></br>The companies <b><u><font color="brown">said</font></u></b> the acquisition is subject to a definitive agreement. <br></br>They said they expect the transaction to be completed by Dec. 15. <br></br>Weisfield's, based in Seattle, Wash., currently operates 87 specialty jewelry stores in nine states. <br></br>In the fiscal year ended Jan. 31, the company reported sales of $59.5 million and pretax profit of $2.9 million. <br></br>Ratners, which controls 25%: of the British jewelry market, would increase the number of its U.S. stores to about 450 stores from 360. <br></br>It has said it hopes to control 5%: of jewelry business in the U.S. by 1992; currently it controls about 2%:. <br></br></p>

 Event1: <font color="blue">acquire</font>

 Event2: <font color="purple">reached</font>

 Expert Answer: after

 Crowd Answer with CrowdTruth: Counter({'after': 1.0, 'before': 0.0})

 Crowd Answer without CrowdTruth: Counter({'after': 1.0, 'before': 0.0})

Example of an unclear unit


In [19]:
sortedUQS.head(1)


Out[19]:
duration input.event1 input.event2 input.gold input.text job output.response output.response.annotations output.response.unique_annotations worker uqs unit_annotation_score uqs_initial unit_annotation_score_initial
unit
6_12 83 <font color="red">turn</font> <font color="purple">said</font> before <p>Magna International Inc..'s chief financial... ../data/temp.standardized {u'after': 5, u'before': 5} 10 2 10 0.436384 {u'after': 0.475324306685, u'before': 0.524675... 0.444444 {u'after': 0.5, u'before': 0.5}

In [20]:
print("Text: %s" % sortedUQS["input.text"].iloc[0])
print("\n Event1: %s" % sortedUQS["input.event1"].iloc[0])
print("\n Event2: %s" % sortedUQS["input.event2"].iloc[0])
print("\n Expert Answer: %s" % sortedUQS["input.gold"].iloc[0])
print("\n Crowd Answer with CrowdTruth: %s" % sortedUQS["unit_annotation_score"].iloc[0])
print("\n Crowd Answer without CrowdTruth: %s" % sortedUQS["unit_annotation_score_initial"].iloc[0])


Text: <p>Magna International Inc..'s chief financial officer, James McAlpine, <b><u><font color="blue">resigned</font></u></b> and its chairman, Frank Stronach, is <b><u><font color="green">stepping</font></u></b> in to <b><u><font color="orange">help</font></u></b> <b><u><font color="red">turn</font></u></b> the automotive-parts manufacturer around, the company <b><u><font color="purple">said</font></u></b>. <br></br>Mr. Stronach will <b><u><font color="brown">direct</font></u></b> an effort to reduce overhead and curb capital spending'' until a more satisfactory level of profit is achieved and maintained,'' Magna said. <br></br>Stephen Akerfeldt, currently vice president finance, will succeed Mr. McAlpine. <br></br>An ambitious expansion has left Magna with excess capacity and a heavy debt load as the automotive industry enters a downturn. <br></br>The company has reported declines in operating profit in each of the past three years, despite steady sales growth. <br></br>Magna recently cut its quarterly dividend in half and the company's Class A shares are wallowing far below their 52-week high of 16.125 Canadian dollars -LRB- US $13.73 -RRB-. <br></br>On the Toronto Stock Exchange yesterday, Magna shares closed up 37.5 Canadian cents to C$9.625. <br></br>Mr. Stronach, founder and controlling shareholder of Magna, resigned as chief executive officer last year to seek, unsuccessfully, a seat in Canada's Parliament. <br></br>Analysts said Mr. Stronach wants to resume a more influential role in running the company. <br></br>They expect him to cut costs throughout the organization. <br></br>The company said Mr. Stronach will personally direct the restructuring, assisted by Manfred Gingl, president and chief executive. <br></br>Neither they nor Mr. McAlpine could be reached for comment. <br></br>Magna said Mr. McAlpine resigned to pursue a consulting career, with Magna as one of his clients. <br></br></p>

 Event1: <font color="red">turn</font>

 Event2: <font color="purple">said</font>

 Expert Answer: before

 Crowd Answer with CrowdTruth: Counter({'before': 0.5246756933153717, 'after': 0.47532430668462844})

 Crowd Answer without CrowdTruth: Counter({'after': 0.5, 'before': 0.5})

MACE for Recognizing Textual Entailment Annotation

We first pre-processed the crowd results to create compatible files for running the MACE tool. Each row in a csv file should point to a unit in the dataset and each column in the csv file should point to a worker. The content of the csv file captures the worker answer for that particular unit (or remains empty if the worker did not annotate that unit).


In [29]:
import numpy as np

test_data = pd.read_csv("../data/mace_temp.standardized.csv", header=None)
test_data = test_data.replace(np.nan, '', regex=True)
test_data.head()


Out[29]:
0 1 2 3 4 5 6 7 8 9 ... 66 67 68 69 70 71 72 73 74 75
0 before before before before before before after after before after ...
1 before before after after after ...
2 after before after after after ...
3 after before after after after ...
4 before before after after after ...

5 rows × 76 columns


In [31]:
import pandas as pd

mace_data = pd.read_csv("../data/results/mace_units_temp.csv")
mace_data.head()


Out[31]:
unit before after
0 42_0 1.000000e+00 2.587036e-09
1 11_1 9.999999e-01 1.071343e-07
2 11_0 5.038665e-08 9.999999e-01
3 112_0 5.480848e-08 9.999999e-01
4 9_2 1.000000e+00 2.005877e-08

In [32]:
mace_workers = pd.read_csv("../data/results/mace_workers_temp.csv")
mace_workers.head()


Out[32]:
worker competence
0 A2HTGQE4AACVRV 0.942213
1 AYHHOK9GDSWNH 0.928760
2 A1QRQZWBL1SVEX 0.942213
3 A3G0MGLBT484I1 0.000667
4 A7NC1H5ZK7TO0 0.942213

CrowdTruth vs. MACE on Worker Quality


In [33]:
mace_workers = pd.read_csv("../data/results/mace_workers_temp.csv")
crowdtruth_workers = pd.read_csv("../data/results/crowdtruth_workers_temp.csv")

mace_workers = mace_workers.sort_values(["worker"])
crowdtruth_workers = crowdtruth_workers.sort_values(["worker"])

In [34]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

plt.scatter(
    mace_workers["competence"],
    crowdtruth_workers["wqs"],
)

plt.title("Worker Quality Score")
plt.xlabel("MACE")
plt.ylabel("CrowdTruth")


Out[34]:
Text(0,0.5,'CrowdTruth')

In [35]:
sortWQS = crowdtruth_workers.sort_values(['wqs'], ascending=[1])
sortWQS = sortWQS.reset_index()
worker_ids = list(sortWQS["worker"])

mace_workers = mace_workers.set_index('worker')
mace_workers.loc[worker_ids]

plt.rcParams['figure.figsize'] = 15, 5

plt.plot(np.arange(sortWQS.shape[0]), sortWQS["wqs"], 'bo', lw = 1, label = "CrowdTruth Worker Score")
plt.plot(np.arange(mace_workers.shape[0]), mace_workers["competence"], 'go', lw = 1, label = "MACE Worker Score")

plt.ylabel('Worker Quality Score')
plt.xlabel('Worker Index')
plt.legend()


Out[35]:
<matplotlib.legend.Legend at 0x10faea7d0>

CrowdTruth vs. MACE vs. Majority Vote on Annotation Performance


In [46]:
mace = pd.read_csv("../data/results/mace_units_temp.csv")
crowdtruth = pd.read_csv("../data/results/crowdtruth_units_temp.csv")

In [39]:
def compute_F1_score(dataset):
    nyt_f1 = np.zeros(shape=(100, 2))
    for idx in xrange(0, 100):
        thresh = (idx + 1) / 100.0
        tp = 0
        fp = 0
        tn = 0
        fn = 0

        for gt_idx in range(0, len(dataset.index)):
            if dataset['after'].iloc[gt_idx] >= thresh:
                if dataset['gold'].iloc[gt_idx] == 'after':
                    tp = tp + 1.0
                else:
                    fp = fp + 1.0
            else:
                if dataset['gold'].iloc[gt_idx] == 'after':
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0


        nyt_f1[idx, 0] = thresh
    
        if tp != 0:
            nyt_f1[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)
        else:
            nyt_f1[idx, 1] = 0
    return nyt_f1


def compute_majority_vote(dataset, crowd_column):
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for j in range(len(dataset.index)):
        if dataset['after_initial'].iloc[j] >= 0.5:
            if dataset['gold'].iloc[j] == 'after':
                tp = tp + 1.0
            else:
                fp = fp + 1.0
        else:
            if dataset['gold'].iloc[j] == 'after':
                fn = fn + 1.0
            else:
                tn = tn + 1.0
    return 2.0 * tp / (2.0 * tp + fp + fn)

In [43]:
F1_crowdtruth = compute_F1_score(crowdtruth)
print(F1_crowdtruth[F1_crowdtruth[:,1].argsort()][-10:])


[[0.46       0.9352518 ]
 [0.5        0.9368932 ]
 [0.51       0.9368932 ]
 [0.52       0.9368932 ]
 [0.53       0.9368932 ]
 [0.54       0.9368932 ]
 [0.45       0.93779904]
 [0.44       0.93779904]
 [0.49       0.93946731]
 [0.48       0.93975904]]

In [47]:
F1_mace = compute_F1_score(mace)
print(F1_mace[F1_mace[:,1].argsort()][-10:])


[[0.14       0.93430657]
 [0.15       0.93430657]
 [0.16       0.93430657]
 [0.17       0.93430657]
 [0.18       0.93430657]
 [0.19       0.93430657]
 [0.21       0.93430657]
 [0.1        0.93430657]
 [0.2        0.93430657]
 [0.05       0.9368932 ]]

In [48]:
F1_majority_vote = compute_majority_vote(crowdtruth, 'value')
F1_majority_vote


Out[48]:
0.05240174672489083

In [ ]: